feat(arrow): expose Arrow IPC reader via registerArrow and readArrow by LantaoJin · Pull Request #52 · apache/datafusion-java

LantaoJin · 2026-05-15T04:08:24Z

Which issue does this PR close?

Closes feat: expose Arrow IPC reader via registerArrow and readArrow #37.

Rationale for this change

DataFusion 53.x supports Arrow IPC files via SessionContext::register_arrow / read_arrow, but the Java bindings only expose Parquet, CSV, and (in #47) NDJSON. Since JVM results already come back as Arrow batches via the C Data Interface, an Arrow IPC reader on the Java side closes the natural round-trip: Java callers can write Arrow IPC to disk with arrow-vector's ArrowFileWriter, then read it back through DataFusion without going through Parquet or any other intermediate format. Today they have to fall back to CREATE EXTERNAL TABLE … STORED AS ARROW via SQL, which works but bypasses the typed builder pattern.

This PR is the Java surface for the existing upstream functionality. Issue #37 tracks it; the implementation follows the same proto-over-JNI pattern as #47 (NDJSON), #29 (the CSV/Parquet refactor), and the merged CSV/Parquet readers.

What changes are included in this PR?

proto/arrow_read_options.proto — new ArrowReadOptionsProto message. Single field: file_extension (default .arrow). Explicit Arrow schema rides on the existing IPC byte channel through the JNI layer, mirroring the parquet/csv/json paths, and is therefore not encoded in this message. No FileCompressionType field — Arrow IPC files carry body compression (LZ4_FRAME / ZSTD per-buffer) inside the file format itself.
ArrowReadOptions Java builder with fileExtension(String) and schema(Schema) setters.
SessionContext.registerArrow(name, path[, options]) and readArrow(path[, options]) overloads, structurally identical to the parquet/csv/json entry points.
native/src/arrow.rs — JNI module that decodes ArrowReadOptionsProto, constructs the upstream ArrowReadOptions, and forwards to register_arrow / read_arrow. Imports ArrowReadOptions from datafusion::execution::options rather than prelude (it's not re-exported there, same situation as JsonReadOptions).

Out of scope (for follow-ups):

tablePartitionCols: neither parquet, csv, nor ndjson exposes Hive-style partitioning on the Java side yet. Adding it for Arrow only would diverge.

Are these changes tested?

Yes, 9 new tests across ArrowReadOptionsTest and SessionContextArrowTest.

Are there any user-facing changes?

Yes, purely additive. New public API:

org.apache.datafusion.ArrowReadOptions
SessionContext.registerArrow(String, String)
SessionContext.registerArrow(String, String, ArrowReadOptions)
SessionContext.readArrow(String) → DataFrame
SessionContext.readArrow(String, ArrowReadOptions) → DataFrame

The new org.apache.datafusion.protobuf.ArrowReadOptionsProto generated class is also exposed via the protobuf-Java output, consistent with how CsvReadOptionsProto, NdJsonReadOptionsProto, and ParquetReadOptionsProto are exposed. No API removals, no deprecations, no behavior change for existing callers.

Closes apache#37 Mirrors the existing parquet/csv/json reader pattern for Arrow IPC files. Adds: - proto/arrow_read_options.proto with the ArrowReadOptionsProto message (file_extension only; explicit Arrow schema rides on the existing schema-IPC byte channel rather than this proto, matching the other formats) - ArrowReadOptions Java builder with fileExtension default ".arrow" and schema(Schema) - SessionContext.registerArrow(name, path[, options]) and readArrow(path[, options]) overloads with null-argument validation per Andy's apache#47 review feedback - native/src/arrow.rs JNI module that decodes the proto and dispatches to upstream SessionContext::register_arrow / read_arrow Note on ArrowReadOptions construction: upstream's ArrowReadOptions exposes file_extension as a public field (not a builder setter), unlike the other format options. The native side uses struct-update syntax to set it without tripping clippy's field_reassign_with_default lint. Tests cover proto round-trip, schema-by-reference, register/read on a fixture written by arrow-vector's ArrowFileWriter (the canonical Arrow IPC file format DataFusion's source supports), custom file extension, explicit Arrow schema, and null-argument rejection on both register and read. Out of scope: tablePartitionCols (no parquet/csv/json analog on the Java side yet). Arrow IPC carries body compression inside the file format itself, so unlike CSV and NDJSON there is no FileCompressionType on this options class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52

feat(arrow): expose Arrow IPC reader via registerArrow and readArrow#52
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/arrow-source

LantaoJin commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LantaoJin commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LantaoJin commented May 15, 2026 •

edited

Loading